Draft

Showcase with Dominick’s dataset

Dominicks offers us a rich real world scanner dateset (focusing on food from stores)

Overview of the data

The Dominick’s dataset is a rich rich and very useful dataset to leverage in validating how diagnostics and analytics can be designed with scanner data. Since it was first proposed by Mehrhoff (2018), it has been widely used within price statistics, from analysis of multilateral methods by Lamboray (2021) to empirical tests for the decomposition of multilateral methods by Webster and Tarnow-Mordi (2019). We leverage this dataset as it provides a very realistic example that NSOs would need to work with.

The dataset comes pre-categorized in separate product and movement files on the Chicago Booth School website ( ([2013] 2018)).1 This makes it simple to work with as we can assume that the category represents the elementary index prior to its integration into the CPI with other retailer dataset and field data.

Example indices

Prior to jumping into specific time periods to trial and showcase how various diagnostic dashboards work, an overall index is useful to select these specific time periods. Figure 1 shows the Bottled Juices category over several years.

Show the code
# Load your functions
source("../../src/dominicks_utils.R")

library(plotly)

ird <- homogenous_product_aggregation(
        category_name='bjc',
        data_dir='../../data/processed',
        time_sample=c(1,2),
        group_by_parameters=c('NITEM', 'REF_PERIOD'),
        window=list(
        "start" = "1990-01-01",
        "end"   = "1997-03-01")
    )

ccdi <- spliced_CCDI(ird)
ccdi_df <- data.frame(
  period = names(ccdi),
  score = as.numeric(ccdi)
)

fig <- plot_ly(ccdi_df, 
               x = ~period, 
               y = ~score, 
               type = 'scatter', 
               mode = 'lines',
               text = ~paste("Period:", period, "<br>Index:", score),
               hoverinfo = 'text') %>%
  layout(title = "CCDI with mean splice (Bottled Juices category)",
         xaxis = list(title = "Time Period", type = 'category'), # Ensures order is maintained
         yaxis = list(title = "Index", tickformat = ".1"))

fig
Figure 1: Bottled Juices GEKS-T example

Several time periods are possible to choose from this window for targeted tests.

Dominicks specific data flow

To contextualize the data flow, Figure 2 shows how data flows in a research sense, but also how it would flow were it to be a real dataset.

Figure 2: Visual flow of data processing for Dominick’s as if it were a real production dataset

References

(2013) 2018. Dominick’s Data Manual. Kilts Center for Marketing.
Lamboray, Claude. 2021. “Index Compilation Techniques for Scanner Data: An Overview.” Group of Experts on Consumer Price Indices.
Mehrhoff, Jens. 2018. “Promoting the Use of a Publically Available Scanner Data Set in Price Index Research and for Capacity Building.” Manuscript, European Commission, Https://Bit. Ly/2ZBUbg9.
Webster, Michael, and Rory C Tarnow-Mordi. 2019. “Decomposing Multilateral Price Indexes into the Contributions of Individual Commodities.” Journal of Official Statistics 35 (2): 461–86.

Footnotes

  1. See catalogue record in the Price Statistics Open Data catalogue for more information about the dataset.↩︎